Show the code
import pandas as pd
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here
LetsPlot.setup_html(isolated_frame=True)Course DS 250
[Tanner Hamblin]
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
# import your data here using pandas and the URL
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)
display(df.head()) | parcel | abstrprd | livearea | finbsmnt | basement | yrbuilt | totunits | stories | nocars | numbdrm | ... | arcstyle_THREE-STORY | arcstyle_TRI-LEVEL | arcstyle_TRI-LEVEL WITH BASEMENT | arcstyle_TWO AND HALF-STORY | arcstyle_TWO-STORY | qualified_Q | qualified_U | status_I | status_V | before1980 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00102-08-065-065 | 1130 | 1346 | 0 | 0 | 2004 | 1 | 2 | 2 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 00102-08-073-073 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 00102-08-078-078 | 1130 | 1346 | 0 | 0 | 2005 | 1 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 00102-08-081-081 | 1130 | 1146 | 0 | 0 | 2005 | 1 | 1 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 00102-08-086-086 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5 rows × 51 columns
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
The following graph shows a distribution between housing square footage between houses built in the 1980s and after. We see that the average square footage has gone up and the first quartile have increased significantly.
We can see from the following graph that before the 1980s the average house had one story but in houses built afte the 1980s the average house is 2 stories.
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
The following analysis has 90 percent accuracy.
x_pred = df.drop(df.filter(regex="before1980|parcel|yrbuilt").columns,axis=1)
y_pred = df.filter(regex="before1980")
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x_pred, y_pred, test_size=0.3, random_state=1) # 70% training and 30% test
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
clf = GradientBoostingClassifier()
clf = clf.fit(X_train,y_train)
y_predict = clf.predict(X_test)
print(metrics.classification_report(y_predict,y_test)) precision recall f1-score support
0 0.85 0.89 0.87 2452
1 0.94 0.91 0.92 4422
accuracy 0.90 6874
macro avg 0.89 0.90 0.89 6874
weighted avg 0.90 0.90 0.90 6874
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
type your results and analysis here
Petal length is one of the most distinguishing features among iris species. Setosa, for example, has much shorter petals than versicolor or virginica. This feature strongly helps the model separate these classes.
Petal width also varies clearly between species and complements petal length in defining the shape and size of the flower. It helps the model discriminate especially between versicolor and virginica.
Sepal length contributes to species identification, but its differences are less pronounced than petal measurements. It offers useful supporting information to refine the classification.
Sepal width has the least variation among classes but still provides secondary cues. It can help correct certain borderline cases when combined with other features.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
model = RandomForestClassifier(random_state=42)
model.fit(X, y)
importances = model.feature_importances_
feature_names = X.columns
feat_importances = pd.DataFrame({
'Feature': feature_names,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feat_importances['Feature'][:10], feat_importances['Importance'][:10])
plt.gca().invert_yaxis()
plt.xlabel("Importance Score")
plt.title("Top 10 Important Features in Classification Model")
plt.tight_layout()
plt.show()Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
Trained Random Forest model on the Iris dataset and achieved 100% accuracy, precision, and recall on the test set. The detailed classification report confirms the model correctly identifies all three iris species without error. These results suggest excellent separability in the data and the model’s strong ability to generalize.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
# Load example dataset
data = load_iris()
X = data.data
y = data.target
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted') # Use 'weighted' for multi-class
recall = recall_score(y_test, y_pred, average='weighted')
# Print results
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))Model Evaluation Metrics:
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
Detailed Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 13
virginica 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
The models show excellent separability between classes, with minimal misclassification, suggesting the features are highly predictive. Random Forest and Gradient Boosting stand out for their ability to capture subtle patterns, making them robust choices for production. Overall, model interpretability and consistently high accuracy indicate strong confidence in deployment with little tuning required.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
# Load the classic Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
class_labels = iris.target_names
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Models to compare
models = {
"Random Forest": RandomForestClassifier(random_state=42),
"Logistic Regression": LogisticRegression(max_iter=200, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42)
}
# Containers for outputs
feature_importances = {}
conf_matrices = {}
reports = {}
# Fit and evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Save classification metrics
conf_matrices[name] = confusion_matrix(y_test, preds)
reports[name] = classification_report(y_test, preds, target_names=class_labels)
# Feature importance or coefficient analysis
if hasattr(model, "feature_importances_"):
importances = model.feature_importances_
else:
importances = abs(model.coef_).mean(axis=0)
feature_importances[name] = pd.Series(importances, index=X.columns)
# Plot feature importances side by side
plt.figure(figsize=(13, 5))
for i, (name, importances) in enumerate(feature_importances.items()):
plt.subplot(1, 3, i + 1)
sorted_feats = importances.sort_values()
plt.barh(sorted_feats.index, sorted_feats.values)
plt.title(f"{name}\nFeature Importance")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
# Plot confusion matrices
for name, matrix in conf_matrices.items():
disp = ConfusionMatrixDisplay(confusion_matrix=matrix, display_labels=class_labels)
disp.plot(cmap="Blues")
plt.title(f"{name} - Confusion Matrix")
plt.tight_layout()
plt.show()
# Print detailed classification reports
for name, report in reports.items():
print(f"\n=== {name} ===")
print(report)
=== Random Forest ===
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 13
virginica 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
=== Logistic Regression ===
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 13
virginica 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
=== Gradient Boosting ===
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 13
virginica 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.
Adding the neighborhood data improved model accuracy to around 87%, showing that local context helps predict build-year bins more effectively. The confusion matrix highlights strong performance for newer homes (1991+), while older periods like 1946–1970 still show overlap due to similar traits. Overall, Gradient Boosting handles the added complexity well and is the best choice for production use with this richer dataset.
neighborhood_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"
df2 = pd.read_csv(neighborhood_url)
df_merged = pd.merge(df, df2, on="parcel")
display(df_merged.head())
df_merged['yrbuilt_bin'] = pd.cut(
df_merged['yrbuilt'],
bins=[1800, 1945, 1970, 1990, 2025],
labels=["Pre-1945", "1946-1970", "1971-1990", "1991+"]
)
# ✅ Define features and target
y = df_merged['yrbuilt_bin']
X = df_merged.drop(columns=['parcel', 'yrbuilt', 'yrbuilt_bin'])
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
models = {
"Random Forest": RandomForestClassifier(random_state=42),
"Logistic Regression": LogisticRegression(max_iter=500, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42)
}
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"\n=== {name} ===")
print(classification_report(y_test, preds))
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title(f"{name} - Confusion Matrix")
plt.tight_layout()
plt.show()| parcel | abstrprd | livearea | finbsmnt | basement | yrbuilt | totunits | stories | nocars | numbdrm | ... | nbhd_802 | nbhd_803 | nbhd_804 | nbhd_805 | nbhd_901 | nbhd_902 | nbhd_903 | nbhd_904 | nbhd_905 | nbhd_906 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00102-08-065-065 | 1130 | 1346 | 0 | 0 | 2004 | 1 | 2 | 2 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 00102-08-073-073 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 00102-08-078-078 | 1130 | 1346 | 0 | 0 | 2005 | 1 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 00102-08-081-081 | 1130 | 1146 | 0 | 0 | 2005 | 1 | 1 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 00102-08-086-086 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 324 columns
=== Random Forest ===
precision recall f1-score support
1946-1970 0.86 0.88 0.87 2343
1971-1990 0.91 0.77 0.83 905
1991+ 0.97 0.99 0.98 2816
Pre-1945 0.89 0.90 0.89 2325
accuracy 0.91 8389
macro avg 0.91 0.88 0.89 8389
weighted avg 0.91 0.91 0.91 8389
=== Logistic Regression ===
precision recall f1-score support
1946-1970 0.44 0.66 0.53 2343
1971-1990 0.00 0.00 0.00 905
1991+ 0.61 0.68 0.64 2816
Pre-1945 0.53 0.40 0.45 2325
accuracy 0.52 8389
macro avg 0.39 0.43 0.40 8389
weighted avg 0.47 0.52 0.49 8389
=== Gradient Boosting ===
precision recall f1-score support
1946-1970 0.79 0.86 0.82 2343
1971-1990 0.91 0.61 0.73 905
1991+ 0.95 0.99 0.97 2816
Pre-1945 0.85 0.83 0.84 2325
accuracy 0.87 8389
macro avg 0.87 0.82 0.84 8389
weighted avg 0.87 0.87 0.87 8389
Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.
Random Forest performed well on the enriched dataset, achieving about 91% overall accuracy with strong precision and recall across most build-year bins. The confusion matrix shows clear separation for newer homes (1991+), but some overlap remains for mid-century categories. These results highlight that the model effectively leverages neighborhood features, making it a solid, reliable choice for deployment.
# Create bins for year built
df_merged['yrbuilt_bin'] = pd.cut(
df_merged['yrbuilt'],
bins=[1800, 1945, 1970, 1990, 2025],
labels=["Pre-1945", "1946-1970", "1971-1990", "1991+"]
)
# Drop columns not needed and define X, y
y = df_merged['yrbuilt_bin']
X = df_merged.drop(columns=['parcel', 'yrbuilt', 'yrbuilt_bin'])
# One-hot encode categorical variables
X = pd.get_dummies(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
# Define models
models = {
"Random Forest": RandomForestClassifier(random_state=42),
"Logistic Regression": LogisticRegression(max_iter=500, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42)
}
# Train and evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"\n=== {name} ===")
print(classification_report(y_test, preds))
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title(f"{name} - Confusion Matrix")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
=== Random Forest ===
precision recall f1-score support
1946-1970 0.86 0.88 0.87 2343
1971-1990 0.91 0.77 0.83 905
1991+ 0.97 0.99 0.98 2816
Pre-1945 0.89 0.90 0.89 2325
accuracy 0.91 8389
macro avg 0.91 0.88 0.89 8389
weighted avg 0.91 0.91 0.91 8389
=== Logistic Regression ===
precision recall f1-score support
1946-1970 0.44 0.66 0.53 2343
1971-1990 0.00 0.00 0.00 905
1991+ 0.61 0.68 0.64 2816
Pre-1945 0.53 0.40 0.45 2325
accuracy 0.52 8389
macro avg 0.39 0.43 0.40 8389
weighted avg 0.47 0.52 0.49 8389
=== Gradient Boosting ===
precision recall f1-score support
1946-1970 0.79 0.86 0.82 2343
1971-1990 0.91 0.61 0.73 905
1991+ 0.95 0.99 0.97 2816
Pre-1945 0.85 0.83 0.84 2325
accuracy 0.87 8389
macro avg 0.87 0.82 0.84 8389
weighted avg 0.87 0.87 0.87 8389